;  _  4  S  S  e  ~  rr  *  —  S  Ci  mi  m  k  *•  I  *'  9  d 


REPORT  DOCUMENTATION  PAGE 


n  8  E  o 


pifC  j  READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 


|2  GOVT  tCCCSSION  MO  I  »  RECIPIENT'S  Catalog  NUMBER 


928 


*  *  t  *  ..  £  i  and  subunit 


1  TYPE  Of  *l»0»T  A  •C»IOO  COVERED 


Parallel  Algorithms  for  Computer  Vision 
on  the  Connection  Machine 


Memo 


PERFORMING  ORG  report  number 


I  '  »u’hO»ii, 

James  J.  Little 


contract  or  grant  numbeRiij 


DACA76-85-C-00 1 0 
N00014-85-K-01 24 


10.  PROGRAM  Element  frojEC’  t  ask 
AREA  t  WORK  UNIT  NUMBERS 


I  CONTROLLING  OFFICE  NAME  AND  AOORESS 

Advanced  Research  Projects  Agency 
1400  Wilson  Blvd. 

Arlington,  VA  22209 


monitoring  agency  name  A  AOORESSfll  dlllmrmtt  Irom  Con 

Office  of  Naval  Research 
Information  Systems 
Arlington,  VA  22217 


12.  REPORT  DATE 

November  1986 


IV  NUMBER  OF  RASES 

30 


Inlllnf  Ol/lem)  IS.  SECURITY  CLASS  Id  l hit  tmp»n) 


UNCLASSIFIED 


ISa  DECLASSIFICATION/  DOWNGRADING 
SCHEDULE 


Distribution  is  unlimited 


IT  distribution  STATEMENT  (ml  m»  aWafraci  antaratf  In  BlacS  10,  II  Olllmtmt I  Irtm  * 


[  20  ABSTRACT  /Canunwa  an  raaaraa  alWa  II  nacaaaarr  RnB  IWanill y  Or  klacA  IMP*NJ 


The  Connection  Machine  is  a  fine-grained  parallel  computer  having  ud  to  64K 
processors.  It  supports  both  local  communication  among  the  processors, 
which  are  situated  in  a  two-dimensional  mesh,  and  high-bandwidth  communication 
among  processors  at  arbitrary  locations,  using  a  message-passing  network.  We 
present  solutions  to  a  set  of  Image  Understanding  problems  for  the  Connection 
Machine.  These  problems  were  proposed  bv  DARPA  to  evaluate  architectures  for 
Image  Understanding  svstems,  and  are  intended  to  comprise  a  representative  — -> 


DD  ,:°:“t,  1473  EDITION  OF  I  MOV  SS  It  OBSOLETE 
s/n  o :o  2- o I*- e«o  i 


UNCLASSIFIED 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  !**••*»  Dpi*  !>»>•*• 


IS  supplementary  notes 

°*jQ 

None 

"  -m 

if  «cf  Y  WO  WO  J  «n  «/#•  tt  n«c«a««r  wi^  W««* 

Computer  vision;  Computational  geometry;  Parallelism 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
ARTIFICIAL  INTELLIGENCE  LABORATORY 


A. I.  Memo  928  November  1986 

PARALLEL  ALGORITHMS  FOR  COMPUTER  VISION 
ON  THE  CONNECTION  MACHINE 

James  J.  Little 

ABSTRACT:  The  Connection  Machine  is  a  fine-grained  parallel  computer 
having  up  to  64K  processors.  It  supports  both  local  communication  among 
the  processors,  which  are  situated  in  a  two-dimensional  mesh,  and  high- 
bandwidth  communication  among  processors  at  arbitrary  locations,  using  a 
message-passing  network.  We  present  solutions  to  a  set  of  Image  Understand¬ 
ing  problems  for  the  Connection  Machine.  These  problems  were  proposed 
by  DARPA  to  evaluate  architectures  for  Image  Understanding  systems,  and 
are  intended  to  comprise  a  representative  sample  of  fundamental  procedures 
to  be  used  in  Image  Understanding.  The  solutions  on  the  Connection  Ma¬ 
chine  embody  general  methods  for  filtering  images,  determining  connectivity 
among  image  elements,  determining  spatial  relations  of  image  elements  and 
computing  graph  properties,  such  as  matchings  and  shortest  paths. 


Acknowledgements.  This  report  describes  research  done  within  the  Arti¬ 
ficial  Intelligence  Laboratory  at  the  Massachusetts  Institute  of  Technology. 
Support  for  the  A.l.  Laboratory’s  artificial  intelligence  research  is  provided 
in  part  by  the  Advanced  Research  Projects  Agency  of  the  Department  of 
Defense  under  Army  contract  number  DACA70-85  C  0010  and  in  part  by 
DARPA  under  Office  of  Naval  Research  contract  N 000 14  85  K  0124. 

''c)  Massachusetts  Institute  of  Technology  1986 

67  .!  J  •: 


1  Introduction 


»  M-  »  *_4  J 


Q 


•  v' 

r  c  -  .*  • 
v  ^  ✓ 


a 


Several  problems  for  vision  research  were  proposed  for  a  DARPA  Workshop 
on  Parallel  Architectures  for  Image  Understanding.  This  document  describes 
the  design  and  implementation  of  solutions  to  these  problems  on  the  Con¬ 
nection  Machine^.  We  describe  the  Connection  Machine  and  its  features 
which  permit  fast  parallel  solutions  to  these  problems.  Then,  we  describe 
each  problem  and  present  its  solution.  In  each  case,  we  provide  an  est  rnate 
of  the  running  times  for  the  sample  problems  on  the  current  version  of  the 
Connection  Machine. 

1.1  The  Connection  Machine 

The  Connection  Machine  [Hi!lis85]  is  a  powerful  fine-grained  parallel  machine 
having  between  16K  and  64K  processors,  operating  under  a  single  instruction 
stream  broadcast  to  all  processors  (figure  1).  It  is  a  Single  Instruction  Mul¬ 
tiple  Data  (SIMD)  machine,  because  all  processors  execute  the  same  control 
stream.  Each  of  the  processors  is  a  simple  1-bit  processor,  currently  with  4K 
bits  of  memory.  There  are  two  modes  of  communication  among  the  proces¬ 
sors:  first,  the  processors  are  connected  by  a  mesh  of  wires  into  a  128  x  512 
grid  network  (the  NEWS  network,  so-called  because  the  connections  are  in 
the  four  cardinal  directions),  allowing  rapid  direct  communication  between 
neighboring  processors,  and,  second,  the  router,  which  allows  messages  to  be 
sent  from  any  processor  to  any  other  processor  in  the  machine.  The  proces¬ 
sors  in  the  Connection  Machine  can  be  envisioned  as  being  the  vertices  of  a/ 
lG-dimensionai  hypercube  (in  fact,  it  is  a  12-dimensional  hypercube;  at  each 
vertex  of  the  hypercubo  resides  a  chip  containing  16  processors).  Figure  2 
shows  a  4-dimensional  hypercube;  each  processor  is  connected  by  4  wires 
to  other  processors.  Each  processor  in  the  Connection  Machine  is  identified 
by  a  unique  integer  in  the  range  0...  65535,  its  hypercube  address,  impos¬ 
ing  a  linear  order  on  the  processors.  This  address  identifies  the  processor 
for  message-passing  by  tin  router.  Messages  pass  along  the  edges  of  the 
hypercube  from  source  processors  to  destination  processors.  An  operation 
where  messages  are  transmitted  among  the  processors  using  the  router  will 
be  termed  a  send  operation.  In  addition  to  local  operations  in  the  processors, 
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Figure  1:  Block  Diagram  of  the  Connection  Machine 

the  Connection  Machine  can  return  to  the  host  machine  the  result  of  various 
operations  on  a  field  in  all  processors;  it  can  return  the  global  maximum, 
minimum,  sum,  logical  AND,  logical  OR  of  the  field. 

To  manipulate  data  structures  with  more  than  64K  elements,  the  Connec¬ 
tion  Machine  provides  virtual  processors.  A  single  physical  processor  operates 
as  a  set  of  multiple  virtual  processors  by  serializing  operations  in  time,  and 
dividing  the  memory  of  each  processor  accordingly.  This  is  otherwise  in¬ 
visible  to  the  user.  The  number  of  virtual  processors  assigned  to  a  physical 
processor  is  denoted  by  the  virtual  processor  ratio  (VP  ratio),  which  is  always 
>  1.  When  the  VP  ratio  is  strictly  greater  than  1,  the  Connection  Machine 
is  necessarily  slowed  down  by  that  factor,  in  most  operations. 

1.2  Powerful  Primitive  Operations 

Many  of  the  problems  investigated  here  must  be  solved  by  a  combination 
of  communication  modes  on  the  Connection  Machine.  The  design  of  these 
algorithms  takes  advantage  of  the  underlying  architecture  of  the  machine  in 
novel  ways.  There  are  several  common,  elementary  operations  used  in  this 
discussion  of  parallel  algorithms.  Sorting,  for  example,  of  all  8-bit  pixel  values 


It 

4^1 


sd 


Figure  2:  4-dimensional  Hypercube 

in  a  512  x  512  image  (VP  of  4:1)  takes  approximately  30  ms.  A  256  X  256 
image  (VP  1:1)  can  be  sorted  in  approximately  10  ms.  This  operation  is 
primitive,  and  is  useful,  because  of  its  power  and  speed. 

1.2.1  Scanning 

The  scan  operation  is  a  primitive,  global  operation  that  uses  the  hypercube 
connections  underlying  the  router  to  distribute  values  among  the  proces¬ 
sors  of  the  Connection  Machine,  scan  takes  a  binary  associative  opera¬ 
tor  @,  with  identity  0,  an  ordered  set  [oo,ai,  • . . ,  Qn-i]  and  returns  the  set 
[a0,  (a0  ©  at), .  .  . ,  (a0  ©  ax  ©  . . .  ©  a„_[)].  The  scan  operations  implement  the 
abstract  operation  known  as  parallel  prefix  [Blelloch86],  Binary  associative 
operations  include  min,  max,  and  plus.  A  max-scan  operation  stores,  in  the 
destination  field  of  the  nth  processor,  the  maximum  value  of  the  source  field  of 
all  processors  0 . .  .  n  1 .  This  is  very  rapid  (<  1  ms)  and  can  be  very  useful. 
Other  operations,  such  as  plus-scan  have  been  implemented.  The  enumerate 
operation  assigns  a  unique  non-negative  integer  to  all  selected  processors,  in 
the  order  of  their  cube-addresses,  using  plus-scan  on  processors  with  initial 
value  unity.  The  copy-scan  operation  takes  a  value  at  the  first  processor  and 
distributes  it  to  the  following  processors. 
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processor-number  =[01234567] 


A  =[5  134392  6] 

Plus -Scan( A)  =  [5  6  9  13  16  25  27  33] 

Max-Scan(A)  =[55555999] 

Figure  3:  Examples  of  Plus-Scan  and  Max-Scan. 

scan  operations  also  work  in  the  NEWS  addressing  scheme,  termed  it. 
grid-scans.  These  allow  one  to  take  the  sum,  find  the  maximum,  copy,  or 
number  values  along  rows  or  columns  of  the  NEWS  grid  auickly.  'Hie  scan 
operations  take  segment  bits  that  divide  the  processor  ordering  into  segments. 
The  beginning  of  each  segment  is  marked  by  a  processor  whose  segment  bit 
is  set;  when  the  scan  operation  encounters  a  segment  bit  which  is  set,  it 
restarts  the  scan  process.  Time  for  scan  operations  are,  for  example,  200  ps 
for  enumerate,  and  350  ns  for  plus-scan  on  an  8-bit  field.  Figure  3  shows  the 
results  of  plus-scan  and  max-scan  operating  on  some  example  data. 

1.2.2  Distance  Doubling 

Another  important  primitive  operation  is  distance  doubling  [Lim86],  which 
can  he  used  to  compute  the  effect  of  any  binary,  associative  operation,  as 
in  scan,  on  processors  linked  in  a  list  or  ring.  For  example,  using  max, 
doubling  can  propagate  the  extremum  of  a  field  in  all  processors  in  the  ring 
in  D(logA')  steps,  where  IV  is  the  number  of  processors  in  the  ring.  Each 
step  ill’-  olves  two  send  operations.  Typically,  the  value  to  be  maximized  is  the 
cube-address  (a  unique  integer  identifier)  of  the  processor.  At  termination, 
each  processor  in  the  ring  knows  the  label  of  the  maximum  processor  in 
the  ring.  h»reafter  termed  the  principal  processor.  This  serves  to  label  all 
connect'd  processors  uniquely  and  to  nominate  a  particular  processor  'the 
princ.tji.il)  as  the  representative  for  the  entire  set  of  connected  processors. 
Figure  t  shows  the  propagation  of  value-  in  a  ring  of  eight  processor-  Ear  h 
prore-  ■  > r  initially,  at  step  0.  has  an  address  of  the  n*  xt  processor  in  'he  ring, 
and  .i  whir  h  E  to  be  maximized.  At  the  termination  of  I ) . .  ’ep.  a 

prof'  -  :  knows  1 1  ■  *  .id dresses  of  processor  2'  •  1  away  and  tin  iiu  of 
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Figure  1:  Distance  Doubling:  Each  box  contains  (left, right  address)  above, 
and  value  below. 

all  values  within  2,_1  processors  away.  In  the  example,  the  maximum  value 
has  been  propagated  to  all  8  processors  in  log  8  =  3  steps. 

1.3  Rules  of  the  Game 

In  analyzing  the  problems  described  here,  output  operations  have  sometimes 
been  included,  but  input  operations  have  been  neglected.  The  justification  for 
this  is  that  a  vision  system  using  a  parallel  processor  such  as  the  Connection 
Machine  should  maintain  its  data  structures  as  long  as  possible  in  the  parallel 
computer.  Transfers  to  and  from  a  serial  host  should  be  avoided  as  often  as 
possible. 

Several  of  the  problem  specifications  state  that  the  input  is  in  the  form 
of  real  numbers.  In  particular,  the  benchmarks  on  Geometric  Constructions 
and  Triangle  Visibility  use  real-valued  coordinates.  The  benchmark  on  edge 
detection  can  be  understood  to  require  real  numbers  for  the  entries  in  the 
“Laplacian”  operator.  The  Connection  Machine,  however,  has  bit-serial  pro¬ 
cessors  and  hence  has  no  fixed  w'ord  length.  It  is  extremely  easy  then  to 
compute  with  indefinite  length  integers;  our  implementation  of  convolution 
uses  this  feature,  so  we  do  not  use  real  numbers  in  smoothing  the  image  for 
edge  detection.  The  only  other  problems  in  which  real  numbers  are  not  used 
are  the  Voronoi  Diagram  and  Euclidean  Minimum  Spanning  Tree  (EMST) 
example;  in  the  first,  the  data  are  assumed  rounded  to  integer  values  so  that 


the  mesh  connections  in  the  Connection  can  be  used  for  brush-tire  propaga¬ 
tion,  and  the  EMST  depends  on  the  Voronoi  Diagram.  All  other  examples 
assume  real  arithmetic  when  necessary. 

The  parallel  computing  environment  at  the  MIT  A1  Lab  consists  o!  a 
Connection  Machine  ! Hillis85l  with  16K  processors,  with  a  Symbolics  3G50 
Lisp  Machine  as  host.  Connection  Machine  programs  utilize  Lisp  syntax,  in  a 
language  called  *Lisp  ,Lasser86].  Statements  in  'Lisp  programs  are  compiled 
and  manipulated  in  the  same  fashion  as  Lisp  statements,  contributing  signif¬ 
icantly  to  t fie  ease  of  programming  the  Connection  Machine.  The  experience 
at  MIT  in  using  the  Connection  Machine  software  environment  has  been  that 
programming  the  Connection  Machine  is  a  relatively  easy  progression  from 
using  Lisp,  and  that  users  can,  within  a  week,  begin  programming  complex 
programs  on  the  Connection  Machine.  The  improvements  in  execution  time 
from  implementation  to  estimated  times  reflect  expected  improvements  in 
micro-code  for  certain  operations  on  the  Connection  Machine,  as  well  as  re¬ 
coding  of  the  algorithms  in  a  low-level  language  (PARIS).  A  compiler  for 
*Lisp  is  being  constructed,  which  will  eliminate  the  necessity  of  re-coding  in 
PA  R  IS.  while  generating  code  which  uses  the  Connection  Machine  efficiently. 


» 


I 

I 

I 


! 


2  Benchmark  Problems 

2.1  Edge  detection 

In  this  task,  assume  that  the  input  m  an  8-bit  digital  image  of  size  512  *  512 
pixels. 

1.  Convolve  the  image  with  an  11x11  sampled  “Laplacian”  operator  [Har- 
alick84].  (Results  within  5  pixels  of  the  image  border  can  be  ignored.) 

2.  Detect  zero-crossings  of  the  output  of  the  operation,  i.e.  pixels  at  which 
the  output  is  positive  but  which  have  neighbors  where  the  output  is 
negative. 

3.  Such  pixels  lie  on  the  borders  of  regions  where  the  Laplacian  is  positive. 
Output  sequences  of  the  coordinates  of  these  pixels  that  lie  along  the 
borders.  (On  border  following  see  [Rosenfeld82],  Section  11.2.2.) 

The  size  of  this  image  requires  4  virtual  processors  per  physical  processor. 
Each  pixel  is  mapped  into  a  virtual  processor. 

2.1.1  Convolution  with  Laplacian 

The  11x11  sample  “Laplacian”  actually  corresponds  to  filtering  with  a  Gaus¬ 
sian  where  o  is  1.4, (  !llaralick84] ,  but  see  [Grimson85],  where  it  is  argued 
that  a  much  larger  mask  should  be  used  for  reliable  results).  But,  for  a  mask 
diameter  of  11  pixels,  the  binomial  approximation  to  the  Gaussian,  followed 
by  a  discrete  Laplacian,  requires  only  3  ms. 

2.1.2  Detecting  Zero-Crossings 

This  takes  negligible  time  (0.05  ms).  Each  processor  need  only  examine  the 
sign  bits  of  neighboring  processors. 

2.1.3  Border  Following 

To  analyze  this  task,  we  consider  two  parameters,  N,  the  number  of  curves  in 
the  image,  and  A/ar,  the  number  of  pixels  on  the  longest  curve.  E<«<  l,  pixel 
in  the  Connection  Machine  can  link  up  with  the  neighbor  pixels  in  tin  i  urve, 
by  examining  its  8-neighbors  in  the  grid,  in  negligible  time  (n  2  ms).  Each 
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pixel  ii-,  the  curve  must  next  be  labeled  with  a  unique  identifier  for  the  curve. 
Doubling  permits  the  pixels  on  the  c  urve  to  selec  t  a  label,  the  address  of  i  he 
principal  processor,  for  the  curve,  and  to  propagate  that  label  throughout  the 
curve  in  Ofiog  Max)  steps. 

Then,  the  total  number  of  curves  can  be  computed  in  350  ps,  by  selecting 
the  principal  processors,  and  enumerating  t hem  using  a  scan  operation.  The 
scan  opt  ration  can  return  the  number  of  curves  (.V). 

At  ■  point  the  curves  have  been  linked,  labeled  uniquely,  and  counted. 
The  s*;  dure  constructed  so  far  is  sufficient  to  support  most  operations  on 
curves  lor  image  understanding,  so  we  can  consider  all  processing  after  t!  is 
to  :<>  1  output  only.  To  output  the  pixels  from  the  Connection  Machine, 
the  points  on  the  curves  should  be  numbered  in  order  to  create  a  stream  of 
conned ed  points.  The  curve-labeling  step,  using  doubling ,  can  be  augmented 
to  re-',  ord  the  distance  from  the  principal  processor ,  as  well  as  its  label,  during 
label  propagation,  at.  only  a  slight  increase  in  message  length.  We  can  find 
the  kuo.'th  of  the  longest  curve,  A  fax,  by  fine  global-max  operation  (200/ns). 

A  s  mp!e  method  suggested  by  Guy  Blelloch  lets  us  assign  to  each  point 
on  a:  edge  an  index,  so  that  the  points  can  be  ordered  in  a  stream  for  output 
from  t. ;  Connection  Machine.  Each  edge  sends  its  length  to  the  processor 
whos<  address  is  the  index  of  the  edge.  Then,  a  plus-scan  on  the  set  of 
prone.-:,  .  <•  representing  these  edge  lengths  generates  the  starting  location,  in 
the  <• 1  ru,  of  the  first  point  in  each  edge.  This  value  is  sent  to  the  first  point 
(t  he  pr:  unpal  point)  in  the  edge,  which  broadcasts  it  to  the  points  in  the  edge, 
using  doubling.  Each  point  constructs  an  index  for  itself  from  its  location  in 
the  edg*-  and  the  stream  location  of  the  first  point.  Ordering  the  pixels  by 
this  met ;  od  takes  log  Max)  ms,  for  doubling ,  two  routing  operations  and 
a  scan 


i  i.i  clered  pixels  then  send  their  (x,y)  values  to  the  address  given  by  the 
rank,  t'li-  takes  on*-  send  operation,  with  no  collisions.  The  (x,y)  coordinates 
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v-  ii  a»'d  on. 


The  total  for  Border  Following  is: 


Propagate  label  and  enumerate  points  1  log  Max  ms 

Enumerate  curves  350/zs 

Rank  pixels  2(log  ,\/<z.r)  •  3 ms 


Send  1  ms 

For  typical  values  in  a  512  x  512  image 

Max  —  512  log  Afar  -  9 

N  =  256  log  N  =  8 

Propagate  label  and  enumerate  points  36ms 

Enumerate  curves  350/zs 

Order  pixels  21  ms 

Send  1ms 


The  first  two  sub-tasks  are  necessary  to  construct  curves  out  of  individual 
pixels.  The  last  two  are  necessary  for  output.  Considering  the  first  two, 
Border  Following  requires  36ms.  The  remaining  time,  to  prepare  for  output, 
is  22ms.  In  total,  approximately  58 ms  is  need  to  perform  Border  Following. 

The  first  two  steps,  Convolution  and  Detecting  Zero  Crossings,  add  neg¬ 
ligible  time  to  this  process,  so  approximately  60ms  will  suffice. 


Edge  Detection 

Sub- task 

Implemented 

Estimated 

Convolution 

3ms 

2  ms 

Find  Zero-Crossings 

0.5ms 

0.5  ms 

Propagate  label 

36m.s 

36  ms 

j 

Enumerate  curves 

350/zs 

350/zs 

Rank  arid  send  pixels 

91  ms 

22m.s 

Total  -  without  Output 

'10ms 

39ms  | 

Totai  -  with  Output, 

1 3 1  ms 

61  ms 

Note:  The  times  quoted  here  are  based  on  a  configuration  of  a  6-1 K  Con¬ 
nection  Machine,  using  a  Virtual  Processor  ratio  of  1:1 


2.2  Connected  component  labeling 

1.  Hero  the  input  is  a  1  -hit  digital  image  of  size  a  12  ■  a  1 2  pixels.  The 
output  is  a  512  y  512  array  of  nonnegative  integers  in  which 

2.  pixels  that  were  0’s  in  the  input  image  have  value  0 

3.  pixels  that,  were  l’s  in  the  input  image  have  positive  values:  two  such 
pixels  have  the  same  value  if  and  only  if  they  belong  to  the  same  con¬ 
nected  component  of  l’s  in  the  input  image  (On  connected  component 
labeling  see  Rosenfeld82j,  Section  11.3.1.) 

A  fast  practical  algorithm  for  labeling  connected  components  in  2-D  im¬ 
age  arrays  using  the  Connection  Machine  has  been  developed  by  Willie  Lim 
l.im8G  .  The  algorithm  has  a  time  complexity  of  Oflog  Ar)  where  .\T  is  the 
number  of  pixels.  The  central  idea  in  the  algorithm  is  that  propagating  the 
largest  or  smallest  number  stored  in  a  linked  list  of  processors  to  all  proces¬ 
sors  in  the  list  takes  O(logL)  time,  where  L  is  the  length  of  the  list,  using 
doubling. 

In  the  algorithm  (see  [Lim86j  for  more  details),  the  label  of  a  connected 
( 1-connected)  component  is  the  largest  processor  address  (i.e.  processor  id) 
of  the  processors  in  the  set.  The  2-D  array  of  processors  in  the  Connection 
Machine  are  numbered  from  left  to  right,  top  to  bottom  fashion.  The  al¬ 
gorithm  first  looks  for  boundary  processors,  i.e.,  processors  which  are  either 
on  the  array  boundary  or  have  at  least  one  neighbor  (8-connected)  with  a 
different  pixel  value.  These  processors  arc  linked  together  to  form  matching 
pairs  of  boundaries  separating  pairs  of  regions.  For  example  if  region  A  is 
completely  surrounded  by  region  0.  then  at  the  border  between  A  and  B 
there  are  two  matching  boundaries  one  on  the  A  side  and  the  other  on  the 
FI  «:d<'  of  the  bord-u.  The  label  of  each  boundary  is  found  in  O(log.V)  time. 


Since  a  region  can  have  more  than  one  boundary  (e.g.  when  it  surrounds 
one  or  more  region),  the  largest  boundary  label  has  to  bo  found.  This  is 
done  by  building  a  tree  of  boundaries  such  that  ouch  boundary  that  is  not 
the  outermost  boundary  of  a  region  is  connected  to  a  boundary  (in  the  same 
region)  to  its  East.  If  the  o  i  more  than  one  boundary  to  its  East,  it  is 
connected  to  the  one  with  the  largest  boundary  label.  Setting  up  this  con¬ 
nectivity  takes  0(log  .V)  time.  The  tree  of  boundaries  is  used  for  joining  up 
the  boundaries  of  the  region  into  one  long  boundary.  In  another  O(logA') 
step,  the  largest  boundary  label,  which  is  also  the  largest  processor  id  in  the 
set,  is  propagated  to  all  the  boundary  processors  in  the  region.  This  label 
which  is  also  the  region  label  ;s  propagated  to  all  the  processors  in  the  region 
in  another  Q(log  N)  step.  Thus  the  whole  algorithm  takes  16  log  /Vms  on  the 
Connection  Machine.  The  c  omplexity  of  this  step  is  measured  in  terms  of  the 
longest  boundary  in  the  image.  If  A’  is  of  the  order  of  512*512,  then  log  N  is 
18.  so  the  estimated  time  for  this  operation  is  300ms  (worst  case;).  When  the 
longest  boundary  is  approximately  512  pixels  long,  the  time  is  150ms.  Note 
that  these  estimates  are  based  on  existing  hardware. 

Another  connected  component  algorithm  by  Guy  Blelloch  utilizes  scan 
operations  along  grid-lines.  In  each  phase  of  his  algorithm,  the  label  of  a 
region,  as  specified  by  the  processor  with  maximum  cube-address,  is  propa¬ 
gated  left,  right,  up  and  down,  with  a  max-scan  operation.  The  number  of 
phases  of  this  algorithm  depends  on  the  alignment  of  figures  in  the  image. 
Its  worst-case  behavior  originates  from  an  image  containing  long  ellipsoidal 
regions,  oriented  along  diagonals.  Present  implementations  require  36ms  per 
phase,  but  expected  rewrites  into  micro-code  will  bring  this  down  to  12ms 
per  phase.  The  number  of  phases  is  commonly  around  12,  which  means  that 
it  also  requires  approximately  150ms  for  a  512  x  512  image. 
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Connected  Component  Labeling 

Method 

Implemented 

Estimated 

Doubling  (length  =  512  x  512) 

— 

300ms 

Doubling  (length  —  512) 

— 

150ms 

Scanning  (12  phases) 

45()ms 

150ms 

Note:  The  times  quoted  here  are  based  on  a  configuration  of  a  64  K  Con 
nection  Machine,  using  a  Virtual  Processor  ratio  of  4:1. 


2.3  Hough  transform 

The  input  is  a  1-bit  digital  image  of  size  512  x  512.  Assume  that 
the  origin  (0,0)  is  at  the  lower  left-hand  corner  of  the  image,  with 
the  x-axis  along  the  bottom  row  The  output  is  a  180x512  array  of 
nonnegative  integers  constructed  as  follows:  For  each  pixel  (x,y) 
having  value  1  in  the  input  image,  and  each  i,  0  <  i  <  180,  add  1 
to  the  output  image  in  position  (ij),  where  j  is  the  perpendicular 
distance  (rounded  to  the  nearest  integer)  from  (0,0)  to  the  line 
through  (x,y)  making  angle  i-degrees  with  the  x-axis  (measured 
counterclockwise).  (This  output  is  a  type  of  Hough  transform;  if 
the  input  image  has  many  collinear  l’s,  they  will  give  rise  to  a 
high-valued  peak  in  the  output  image.  On  Hough  transforms  see 
[Rosenfeld82],  Section  10.3.3.) 

The  solution  to  this  problem  will  involve  180  separate  operations,  each 
of  which  computes  the  Hough  Transform  for  a  particular  angle,  0.  For  each 
angle,  broadcast  cost)  and  sinO  to  each  of  the  processors.  Each  processor  then 
computes  the  scalar  product  of  its  [x,  y)  address  in  the  grid  with  the  normal 
vector  described  by  the  broadcast  pair.  This  number  is  bounded  above  by 
512\/2,  not  512  as  suggested  in  the  problem  description.  This  can,  of  course, 
be  remedied  by  scaling  by  \/2.  Also,  we  can  use  a  clever  trick,  suggested  by 
Mike  Drumheller,  to  reconfigure  the  processors  -  each  computes  its  location 
on  a  linearization  of  the  machine  by  lines  normal  to  the  specified  angle.  Each 
pixel  then  has  a  unique  address,  sequential  along  the  normal  lines,  in  the 
machine.  Each  pixel  can  send  its  value  to  the  processor  with  its  number, 
in  one  router  cycle  (there  are  no  collisions).  The  pixels  then  lie,  in  linear 
order  in  the  machine,  according  to  their  position  on  the  normal  lines.  Each 
processor  at  the  beginning  of  one  of  the  normal  lines  sets  a  segment  bit.  Then 
a  plus-scan  using  segment  bits  accumulates  the  numbers  of  pixels  in  each  line 
for  the  histogram  in  the  processors  with  segment  bits.  One  send  operation 
can  collect  the  values  into  the  histogram.  This  suffices  to  construct  a  column 
of  the  histogram.  Each  angle  requires  some  computation  to 

1.  compute  the  scalar  product 

2.  compute  an  address  along  scan  lines 


One  send ,  followed  by  a  scan,  followed  by  a  send  completes  the  process 
for  a  column.  Each  angle  requires  about  4  ms  (VP  4:1),  and  only  3ms  for 
VP  1:1.  The  entire  Hough  Transform  is  computed  in  approximately  720ms. 
This  estimate  is,  of  course,  based  on  a  512  x  512  image.  For  this  image  size, 
the  Connection  Machine  is  using  a  4:1  VP  ratio,  resulting  in  a  reduction  in 
processing  speed  by  a  factor  of  4  for  most  operations.  For  a  256  x  256  image, 
the  time  for  the  histogram  is  reduced  to  540ms.  The  procedure  describe  here 
uses  unique  addresses  for  the  linearization  step.  There  is  little  penalty  for 
having  up  to  16  collisions  per  destination,  so  a  randomizing  strategy  can  be 
used:  messages  are  sent  to  random  locations  in  a  range  depending  on  the 
normal  distance.  The  messages,  when  they  arrive,  are  summed,  using  the 
send  with  sum  operation. 

Consider  a  Hough  Transform  in  which  edge  fragments  form  the  primitives, 
rather  than  pixels.  Each  edge  point  votes  for  only  one  orientation;  each  point 
generates  an  integer  identifying  its  Hough  Transform  value,  using  no  more 
than  17  bits  (512  x  180).  These  values  are  sorted  in  25 ms,  plus-scanned ,  and 
then  sent  to  the  table.  The  total  is  no  more  than  30ms. 


Hough  Transform 

Method 

Implemented 

Estimated 

Full  180  steps  (512  x  512) 

— 

720ms 

Full  180  steps  (256  x  256) 

— 

540 ms 

From  edge  elements  (512  x  512) 

— 

30  ms 

Note:  The  times  quoted  here  are  based  on  a  configuration  of  a  64K  Con¬ 
nection  Machine,  using  a  Virtual  Processor  ratio  of  4:1. 


2.4  Geometrical  constructions 


The  input  is  a  set  S  of  1000  real  coordinate  pairs,  defining  a  set  of 
1000  points  in  the  plane,  selected  at  random,  with  each  coordinate 
in  the  range  [0,1000’.  Several  outputs  are  required. 

1.  An  ordered  list  of  the  pairs  that  lie  on  the  boundary  of  the 
convex  hull  of  S,  in  sequence  around  the  boundary. 

2.  The  Voronoi  diagram  of  S,  defined  by  the  set  of  coordinates 
of  its  vertices,  the  set  of  pairs  of  vertices  that  are  joined  by 
edges,  and  the  set  of  rays  emanating  from  vertices  and  not 
terminating  at  another  vertex.  (On  Voronoi  diagrams  see 
[Preparata85],  Section  5.5.) 

3.  The  minimal  spanning  tree  of  S,  defined  by  the  set  of  pairs 
of  points  of  S  that  are  joined  by  edges  of  the  tree. 

2.4.1  Convex  Hull 

Each  non-terminating  ray  of  the  Voronoi  Diagram,  described  later,  corre¬ 
sponds  to  an  edge  of  the  convex  hull  of  the  set  of  points.  Generating  the 
ordered  set  of  points  on  the  hull  from  the  Voronoi  diagram  only  requires 
traversing  the  Delaunay  triangulation  along  edges  which  correspond  to  these 
rays,  and  takes  0(11 )  steps,  where  H  is  the  cardinality  of  the  set  of  rays. 
Each  step  involves  following  a  pointer  in  the  Connection  Machine,  less  than 
lm.s. 


An  alternative  method  for  the  convex  hull  calculation  begins  from  (ira- 
harn's  sequential  algorithm  (Preparata85,p. 1 03 i ,  and  does  not  rely  on  the 
underlviag  grid.  Initially,  an  interior  point  is  determined  in  1  extremum 
operations  on  the  Connection  Machine,  finding  the  x  and  y  extrema  of  the 
points.  Kach  point  is  assigned  an  angle  by  constructing  a  vector  from  this 
point  Then  the  points  are  sorted  by  angle  in  20 ms.  Let  us  define  a  convex 
wedge  the  region  formed  by  connecting  a  section  of  the  convex  hull  to 
the  interior  point.  At  first,  the  wedges  are  triangles  formed  from  neighbor¬ 
ing  is  and  the  center  point.  Craham’s  algorithm  recursively  constructs 
convex  wedges  of  size  2 1  by  merging  wedges  of  size  t,  initially  2.  Tilt  outer 
curves  of  these  wedges  can  be  merged  into  new  convex  wedges  in  O(logA’) 
steps  OvermarsSl  .  There  are  O(logA’)  merge  steps,  so  the  overall  computa¬ 
tion  requires  Oflog2  N)  router  operations.  Since  N  -  1000,  log  TV  is  10,  and 
the  whole  process  requires  100ms,  simply  for  the  router  operations.  Other 
computations  may  bring  the  entire  cost  up  to  200ms.  All  computations  are 
in  floating  point.  This  analysis  considers  worst  case. 

A  simple  'Lisp  implementation  of  the  Jarvis  march  algorithm  [Preparata85! 
was  cons!  rue  ted.  In  each  iteration,  each  point  computes  its  slope  from  a  ref¬ 
erence  imint,  which  is  on  the  hull  or  outside  (at  first).  To  compute  the  slope 
needs  two  subtractions  and  one  division.  Lach  step  consists  in  computing 
the  slope,  finding  the  global  minimum  slope,  and  finding  the  point  with  that 
slope  A  simple  implementation  takes  arris  per  step,  which  could  be  reduced 
to  3 ms.  by  re-ceding  in  PARIS.  Trial  examples  with  random  points  had  an 
average  number  of  points  on  the  hull  of  approximately  23.  The  total  time 
required  is  usually  150ms,  which  will  be  reduced  90ms  in  the  PARIS  version. 
This  mf  hod  require-.  3  seconds  if  all  1000  points  were  on  the  hull,  but  it  is 
marg'riT’v  fast*  r  in  the  expected  case. 


2.  1.2  Voronoi  Diagrams 

Aggarwal  et  al.  AggarwalS.5  describe  a  <7(log"  .V)  algorithm  for  computing 
Voronoi  diagrams  in  parallel  using  the  CREW  •  Cora  urn  rt  Head  Exclusive 
Write)  model.  For  this  partu  nlar  'vample.  this  work-  o  it  : . •  1 000  steps, 
each  of  which  will  take  at  least  Im.s.  This  requires  at  I-  :st  I  second  in  total. 
The  algorithm  description  is  sketchy  and  seems  difficult  to  implement.  A 
careful  analysis  might  show  that  this  has  a  high  const, mi  multiplier.  Since 
the  Connection  Machine  has  the  NEWS  network,  a  set  of  .  onnertions 

among  the  processors,  a  brush-lire  method  can  be  easily  implemented  on 
the  Connection  Machine.  The.  points  have  coordinates  in  the  mug,.  0, 1 000  . 
so  the  Connection  Machine  must  use  a  VP  ratio  of  ! « » .  1  io  implement  an 
integer  brush-fire  method.  One  can  argue  that  in  many  vision  appiir  itions 
the  coordinates  of  the  points  are  restricted  to  the  range  of  the  resolut  on  of 
the  camera  coordinate  system,  in  which  case  .312  x  512  is  a  reasonable  ;ange. 
A  VP  ratio  of  1:1  results  from  a  512  x  512  grid. 

Using  the  Euclidean  metric,  and  propagating  the  index  of  the  processor 
containing  the  point,  the  Voronoi  region  around  a  point  can  be  labeled  in 
D  steps,  where  D  is  the  diameter  of  the  largest  Voronoi  region.  The  De¬ 
launay  triangulation,  the  dual  of  the  graph  of  the  Voronoi  diagram,  can  be 
constructed  by  propagating  back  to  the  originator  the  indices  of  all  points 
which  share  a  Voronoi  edge.  This  also  takes  D  steps.  This  can,  of  course,  be 
simplified  by  only  performing  this  back-propagation  step  from  the  Voronoi 
vertices.  Thus,  collisions  can  be  minimized.  Alternatively,  messages  from 
Voronoi  vertices  can  carry  the  neighbor  information  to  the  original  points. 
This  takes  one  router  cycle,  with  an  average  number  of  collisions  of  6.  Propa¬ 
gation  (with  VP  ratio  1:1)  takes  30ms  per  step  in  experiments;  with  coding  in 
PARIS,  or  *  Lisp  compilation,  this  can  be  improved  to  no  more  th<m  1  On  s  per 
step.  With  a  VP  ratio  of  16:1,  a  propagation  step  takes  16()ms  Propagating 
to  all  Voronoi  edges  takes  160/3  ins  (at  16:1),  where  1)  is  the  diameter  of  the 
largest  Voronoi  region.  Trial  examples  with  randomly  distributed  points  in 
the  region  tiad  average  diameter  approximate! v  12.  so  this  step  will  take  less 
than  2  seconds  (16:1),  which  reduces  to  500ms  for  512  ■<  512.  The  addi'ional 
work  to  identify  Voronoi  vertices  and  send  the  information  ah<«ui  connections 
will  take  less  than  1  f'm.s 


2.4.3  Minimum  Spanning  Tree 

Guy  Hl.-lloch  (personal  communication)  has  developed  an  0(2.5  log  .V)  algo¬ 
rithm  for  computing  the  MST  of  a  graph,  where  ,Y  is  the  number  of  vertices 
in  the  graph.  Each  step  in  this  process  requires  approximately  6 ms.  The 
Euclidean  MST  derives  from  the  VD,  so  only  edges  in  the  MST  need  be 
examined.  25  steps  (estimated  for  this  size  graph)  take  150ms.  The  time 
complexity,  concretely,  is  151og  N  ms,  where  N  is  the  number  of  vertices  in 
the  graph. 


Geometric  Constructions 

Sub-task 

Convex  Hull  (from  VD) 
Convex  Hull  (Graham  scan) 


Implemented  Estimated 
'•  50m. s 


200ms 


Colic  ex  Hull  (Jarvis  march) 

150ms 

100ms 

Voronoi  Diagram  (1024  x  1024) 

4  s 

2  s 

Voronoi  Diagram  (512  x  512) 

1  s 

500ms 

Minimum  Spanning  Tree  (from  VD) 

— 

150ms 

_ 

Note.  The  times  quoted  here  are  based  on  a  configuration  of  a  61K  Con¬ 
nection  Machine.  For  the  two  Voronoi  Diagram  methods,  the  Virtual  Proces¬ 
sor  ratios  are  16:1  and  4:1,  and  the  data  points  are  quantized  to  1024  x  1024 
or  512  512  Distance  calculations  are  in  floating  point.  For  the  direct  convex 

hull  (calculations  in  floating  point),  and  minimum  spanning  tree  problems, 
the  \  P  'atio  is  1:1. 


v .  ,!■ .  e 1 -y k .-a-'".  c. - 


2.5  Visibility 


The  input  is  a  set  of  1000  triples  of  triples  of  real  coordinates, 
((r,s,t),  (u,v,w),  (x.y,x)),  defining  1000  opaque  triangles  in  three- 
dimensional  space,  selectee}  a*  random  with  each  coordinate  in 
the  range  [0, 1  OOOj .  The  output  is  a  list  of  vertices  of  the  triangles 
that  are  visib1  from  (0,0,0). 

A  triangle  shadow's  all  vertices  which  lie  in  the  triangular  cone  formed  by 
the  origin  and  the  edges  of  the  triangle,  and  which  are  behind  the  plane  con¬ 
taining  the  triangle.  The  volume  in  space  defined  by  this  criterion  is  described 
by  4  linear  inequalities,  from  the  bounding  half-spaces.  Each  triangle,  in  a 
pre-processing  step,  generates  the  four  plane  equations.  A  vertex  can  then 
be  tested  for  visibility  by  evaluating  these  equations  for  its  (x,y)  coordinates. 
All  vertices  test  whether  they  are  shadowed  by  the  triangle  in  parallel.  The 
time  for  each  triangle  is  approximately  12ms.  Repeating  this  computation 
serially  for  all  1000  triangles  is  obviously  too  expensive. 

The  following  formulation  uses  multiple  copies  of  the  triangles.  The 
problem  can  be  parallelized  by  copying  the  triangles  65  times  in  the  mem¬ 
ory  (64K)  of  the  Connection  Machine.  This  divides  the  machine  into  65 
subsets  of  processors.  Each  triangle  processor  will  handle  up  to  47  points 
(ce?/tny(3000/65)) .  Triangles  0  through  999  occupy  processors  0  through  999 
(cube  address),  and  so  forth.  The  descriptions  of  the  triangles  must  be  gen¬ 
erated.  A  conservative  estimate  of  the  time  for  generating  triangles  is  50ms, 
counting  the  necessary  vector  subtractions  and  cross-products  to  compute 
normal  equations  for  planes.  The  computed  triangle  descriptions  comprise  4 
plane  equations, 

A,  x  +  B,y  +  C,z  +  D  —  0 

each  of  which  contains  4  32-bit  numbers;  the  entire  description  is  512  bits 
long.  The  descriptions  of  all  1000  triangles  can  be  copy-scanned  to  replicate 
them  65  times,  in  15ms,  and  then  sent,  in  one  step,  to  the  correct  processors, 
in  15ms.  Then,  points  are  sent  to  the  sets  of  triangles  against  which  they  arc 
to  be  tested.  The  first  47  points  are  sent  to  processors  0.  ..46,  the  next  47 
to  processors  1000.  .  .3046,  and  so  forth. 


bit-  an  insert ed  at  the  teneiii.it ion  *>!  ea<  q  set  ol  triangles.  , 
t'ach  test  me  step,  the  descript  ion  of  the  point  at  tin'  beginning  of  on*  h  set  of 
points  ts  copy-scanntd  across  the  set.  of  triangles-.,  s running  a  90  hit  ?,2) 

field  taxe-  Tm--.  All  triangles  test  the  aeti\t  Kiint-  in  parallel.  .  [jo/-, 

'L'iien,  t  lie  descriptions  of  tin1  points  are  sen/  |.  ft.  in  Mm.-.  This  hiine  .  mv. 
point  to  the  beginning  of  each  section  ol  triangles,  ready  to  lie  cope  1 
the  Mia  spies  in  the  next  stop  Each  fuli  sn-j  I  sm.'.  Since  f  In  r-  .  17 

steps,  tiie  total  time  required  is  850m  .s. 

An  alternate  formulation  uses  the  grid  stria  tare  of  1  tie  Conti*  •  •  Ma¬ 
chine.  b>  mapping  a  projection  plane.  any  w  here  in  th<  visible  region.  o;  • ,a, 
nal  to  a  line  of  sight  from  t  he  origin,  onto  t  he  250  •  250  grid  of  t  he  ( '< .?> r:.-.  1 
Machine.  More  than  one  vertex  of  a  triangle  may  fall  in  a  parti*  urn-  -ax,  ; 
but,  by  being  careful,  this  can  be  made  to  work.  Next,  the  vert  ire-  of  ;i  <■ 
triangles  generate  lines  in  the  grid,  forming  the  projection  of  the  edge  of  the 
triangles  onto  the  grid,  by  a  standard  vector  to  raster  1  on  version.  S-gme-  * 
bits  are  set  at  these  pixels.  This  step  requires  no  more  than  i  inallv. 

the  projected  vertices  of  triangles  are  distributed  across  the  rows  of  me  grid 
by  a  grtd-srun  operation  using  copy,  stopping  at  the  pixels  containing  pro¬ 
ject*  a  <  dges  of  the  triangles.  Each  time  a  point  encounters  an  edge,  it  checks 
to  see  whether  the  plane  represented  by  the  edge  covers  it.  If  so,  the  point 
turns  off.  and  is  no  longer  handled.  Scan  operations  continue  as  long  as  act! 
points  encounter  edges.  1  he  total  number  of  iterations  is  the  nut  'ber  of  tri¬ 
angles  enclosing,  but  not  covering,  a  point.  Simulations  performed  using  the 
specific*/  number  of  triangles  with  the  given  range  of  coordinates,  randomly 
generated,  showed  that  the  maximum  number  of  triangles  enclosing  hut  not 
covering  a  point  averages  around  200.  Each  seen  operation,  with  a  check  to 
find  whether  tfie  point  is  covered,  requires  no  more  than  5m. s.  The  total, 
approximately  Is,  is  less  than  the  previous  method.  In  addition,  this  method 
dep*  ,  ;  on  the  number  of  triangles  which  overlap  when  projected.  Random 
mput  <1  -pacified  1  >  the  worst  ease  for  this  method;  most  practical  examples 
wiil  ha-.,  maximum  coverings  of  approximat*  [\  10  or  20  triangles. 
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Triangle  Visibility 
Method 
Multiple  copies 
Scanning 


Implemented  Estimated 
—  850m.s 


Note:  The  times  quoted  here  are  based  on  a  configuration  of  a  6-1 K  Con 
nection  Machine,  using  a  Virtual  Processor  ratio  of  1:1.  All  numerical  c.alou 
lations  are  floating  point. 
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2.6  Graph  matching 


The  input  is  a  graph  G  having  100  vertices,  each  joined  by  an 
edge  to  10  other  vertices  selected  at  random,  and  another  graph 
//  having  30  vertices,  each  joined  by  an  edge  to  3  other  vertices 
selected  at  random.  The  output  is  a  list  of  the  occurrences  of 
(an  isomorphic  image  of)  II  as  a  subgraph  of  G.  As  a  variation 
on  this  task,  suppose  the  vertices  (and  edges)  of  G  and  II  have 
i'va!- valued  labels  in  some  bounded  range;  then  the  output  is  that 
occurrence  (if  any)  of  H  as  a  subgraph  of  G  for  which  the  sum  of 
the  absolute  differences  between  corresponding  pairs  of  labels  is 


a  minimum. 


This  task  (subgraph  isomorphism)  is  known  to  be  NP-complete.  As  such, 
we  can  expect  the  worst-case  behavior  of  any  (present)  solution  to  be  expo¬ 
nential  in  the  size  of  the  graph  G.  The  graphs  in  this  particular  problem  are 
uniform  in  degree,  so  that  any  vertex  in  H  can  match  with  any  vertex  in  G, 
based  only  on  degree.  Most  heuristics  for  this  problem  rely  on  non-uniformity 
of  the  degrees  of  vertices  in  the  graphs,  and  so  will  fail  for  this  instance  of 
t he  pi oblern. 

For  ibis  particular  example,  Carl  Feynman  implemented  a  program  to  test 
for  subgraph  isomorphism  on  random  graphs  having  the  specified  structure. 
His  prog:  am  ran  for  17  hours  on  a  Symbolics  3640  Lisp  Machine,  had  found 
13,000  solution  matchings,  and  had  explored  10  8  of  the  search  space,  from 
which  he  conjectured  that  there  were  1012  solutions  for  this  pair  of  random 
graphs  having  the  required  characteristics.  In  the  theory  of  random  graphs 
Boiloba.- !  085, ,  threshold  functions  describe  that  the  probability  of  finding 
a  matching  given  the  sizes  and  degrees  of  two  graphs.  For  graphs  of  the 
specified  sizes  and  degrees,  this  theory  indicates  that  there  is  a  matching 
with  probability  one,  in  other  words,  there  are  many  candidate  matches. 
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We  will  outline  a  method  to  distribute  the  matching  process  among  the 
processors  of  the  ( 'onnection  Machine  A  similar  sedition  |<>r  ob-*-ctioi>  m-c og- 


n  it  ion  is  described  in  llar:is86  .  1  he  method  vw:| 
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hit  par¬ 


ticular  size  of  graph,  but  is  general  <  nmigh  to  used  tor  am  si/e-u  \  matching 
is  a  mapping  p  of  vertices  it  m  it  to  vert  aa  s  in  G\  ■  :ch  that ,  if  two  vertict  -. 
h,  and  h } ,  in  II  are  connected  in  II.  their  images,  //(/).)  and  /o!.;)  are  con¬ 
nected  in  G.  A  matching  will  repi*  settled  as  a  table,  index*  d  by  1  .  ..  //  . 
containing  tee  indites  of  vertices  in  * or  0.  to  indi*ate  no  match.  I'lie  size 
of  a  matching  is  the  number  of  nor,. zero  entries  in  tin  table.  A  successor 


of  a  matching  is  a  net,  mat  itiip 
into  a  vertex  in  <1.  which  *  pr 


which 


a, re  vertex  it:  II  is  mapped 
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location  of  processors  to  matchings.  A  partial  mat*  lung  is  contained  it  each 
active  processor.  At  «•;♦*  !:  step  m  th*  graph  matching  algorWun.  a  matching 
(processor)  acquires  :  he  m'orm.-.-  i*v  u'a  essary  to  •  >-t «'rr*ii r>«'  all  legal  succes¬ 
sors.  It  then  finds  processors  m  continue  with  the  new  matchings:  it  i«  then 
returned  to  th.e  pool  of  free  processors. 

The  descriptions  of  the  graphs  can  be  stored  in  several  ways  in  the  Con¬ 
nection  Machine.  Since  (',  is  l()o,  7  bi’s  are  needed  to  reference  an  entry  in 
G.  The  adjacency  list  of  each  vertex  is  then  70  bits  long,  storing  explicitly 
each  reference.  Since  G  is  100.  the  entire  graph  requires  7000  bits,  more 
than  the  current  Connection  Mac  :  me  provides.  Alternatively,  we  can  use  a 
distributed  representation  of  G,  where  the  adjacency  list  of  each  vertex  in 
G  stored  in  a  different  processor  as  a  100-bit  vector.  Then,  a  matching  pro¬ 
cessor  can  get  the  information  by  using  a  send  operation,  to  the  processor 
with  the  data.  The  vertices  in  G  can  be  stored,  with  many  copies,  through¬ 
out  the  Connection  Machine.  This  means,  with  6'?K  processors,  that  there 
will  be  approximately  655  copies  of  the  graph,  one  for  every  100  matchings. 
Mach  matching  processor  can  access  these  copies  randomly,  so  that  contention 
among  the  processors  is  minimized.  The  address  of  the  vertex  neighbor  list 
for  vertex  G\  needed  by  a  matching  can  be  calculated  from  the*  address  of  the 
matching  processor  and  a  random  variable.  II  is  30,  necessitating  5  bits  to 
reference  a  vertex  in  II .  Each  vertex  has  degree  3,  so  t  fie  complete  description 
of  graph  II  only  requires  30  x  3  x  5  '150  bi  A  matching  needs  to  record 

for  each  vertex  in  If  the  mate!  d  vertex  in  G',  so  it  needs  30  •  7  -  210  bits. 
Each  matefling  processor  e.ontnii  script, am  of  II  as  well  as  the  partial 

mat  <  tnng  i  t  is  ex  pa  nd  i  rig. 


1-  igure  \ . -i 1 1  h  hxpansion  in  <  itapli  Matching 


in  uroi.(-~-i  ■  -  are  allocated  We  use  rendezvous  allocation  il'dii-*.') 
are -son-,  U:  matchings.  The  order  in  which  vertices  in  //  arc 
••  -i  be  pro-computed  to  maximize  the  number  of  vertices  " 
<>  t he  ’cxl  vertex  to  be  expanded.  in  that  way.  maximum  ■■ 
>c  applied  at  ear  h  step.  Consider  a  tableau  (figure  n.  in  v\  hid. 
)f  G  arc  arranged  ieft-to -right  across  the  top,  and  the  vc  rices  til 
•;<-d  ‘  u  -tc)  i.ottom  on  the  let’:.  We  represent  matching  \  •  e.\  h, 

n  m*  r>  ••!  i  r«)w .column i  (i.  / !  in  *he  tableau.  A  partial  mam  I, 
led  ''o'  tn-  partial  mat'  him:  i  r>  t  he  column  above  it.  Sea  re1: 
1  <)•  i  • !.  t: :  I  -  1 1  to- right  fashior-.  Knougli  free  processor  linn  t 

a.  p  •  •  •  ■  ••  .it  a!'  e  xpanding  seat.  :t  nodes  ran  com;  '•  m  . 

‘  i  •  sear<  ft  sub  t  ree  to  leaf  nodes 
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In  each  phase  of  matching  generation,  a  matching  at  level  k ,  1  <  k  <  //  , 
must  expand  itself  to  all  legal  successor  matchings  at  t  he  next  level.  Matching 
processors  may  be  expanding  at  many  different  levels,  since  resource  limita¬ 
tions  may  delay  expansion  until  some  processor  fails,  and  is  returned  to  the 
pool.  To  expand  itself,  a  matching  must  know,  first,  the  neighbors  of  /)*.], 
and,  second,  the  vertices  in  (7  to  which  those  neighbors  have  been  matched. 
These  data  allow  a  matching  at  level  k  to  prune  its  expansion,  generating  only 
legal  successors.  The  description  ol  il  is  stored  locally  in  each  processor.  To 
recover  the  neighbors  of  hk^  ,,  each  processor  steps  through  the  description 
of  H,  until  it  encounters  the  k  •-  l"1  entry,  and  then  records  the  contents  of 
this  entry.  This  takes  no  longer  than  3ms.  This  step  tlnds  the  neighbors  of 
the  new  vertex  in  //. 

Each  expanding  matching  examines  the  neighbors  of  hk.  j  to  determine 
the  nodes  in  G  to  which  they  have  been  matched.  The  neighbors  of  each 
such  vertex  in  G  must  be  retrieved  from  the  distributed  representations  of  G . 
using  a  send.  operation.  The  adjacency  information  in  G  is  stored  as  100-bit 
vectors.  Retrieving  this  information  needs  5 ms  per  vertex,  so  15ms  total 
may  be  required  for  the  three  possible  neighbors  of  hk,  j.  Now,  we  must 
compute  the  intersection  of  these  bit  vectors,  describing  all  possible  nodes  in 
G  adjacent  to  the  matches  in  G  of  neighbors  of  hki. This  can  be  done  in 
time  linear  in  the  number  of  nodes  in  G,  but  such  bit  operations  arc  fast; 
the  total  time  for  graphs  of  this  size  is  estimated  to  be  less  than  3ms.  Then 
we  exclude  from  the  intersection  all  nodes  already  matched  in  the  current 
matching,  leaving  the  possible  expansions  in  G.  This  is  another  fast,  logical 
AND  NOT  operation  on  ihe  bit  vector,  taking  less  than  lms.  The  remaining 
vertices  are  the  possible  expansions  in  G.  All  are  legal,  that  is,  the  nodes  in 
G  to  be  matched  are  unmatched,  and  are  adjacent  to  existing  constraining 
matches  from  H.  If  this  set  is  empty,  the  matching  fails.  The  entire  phase  of 
computing  the  possible  successors  needs  no  more  than  25 ms. 
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2.7  Minimum-cost  path 


The  input  is  a  graph  G  having  1000  vertices,  each  joined  by  an 
edge  to  100  other  vertices  selected  at  random,  and  where  each 
edge  has  a  nonnegative  rcai-val^d  weight  in  some  hounded  range. 

Given  two  vertices  P,Q  of  G,  the  problem  is  to  find  a  path  from 
P  to  Q  along  which  the  sum  of  the  weights  is  minimum. 

The  graph  can  be  represented  as  an  adjacency  list  in  the  Connection 
Machine.  The  algorithm,  a  Connection  Machine  implementation  of  Dijkstra’s 
algorithm,  is  given  in  f II illis85] .  Each  step  in  computing  the  shortest  path 
consists  in  each  vertex  sending  to  each  of  its  neighbors  the  distance  from 
the  source  to  itself  plus  the  length  of  the  connecting  edge  along  which  the 
message  is  sent.  With  this  number  of  vertices  and  edges,  there  are  more  edges 
(100,000)  than  the  number  of  processors,  so  virtual  processors  will  be  used, 
at  the  ratio  of  2:1.  Each  step  involves  a  send  operation,  using  the  router. 
The  receiver  compares  all  incoming  values  and  selects  the  minimum. 

Messages  are  sent  only  when  the  distance  from  the  source  is  less  than  in¬ 
finity  (some  initial  value  for  all  processors).  This  reduces  the  number  of  con¬ 
flicts  at  many  stages.  Initial  experiments  require  9 ms  per  step  and  analysis 
indicates  that  5 ms  per  step  is  possible  to  achieve.  The  number  of  steps  de¬ 
pends  on  the  diameter  (the  length  of  the  longest  path  in  the  graph  explored). 
The  algorithm  stops  when  no  processor  changes  its  value  as  the  result  of  the 
messages  it  has  received.  For  this  particular  problem,  with  such  high  degree 
of  interconnection,  the  number  of  steps  will  be  around  10,  resulting  in  an 
overall  time  to  completion  of  approximately  50ms.  The  implementation  and 
experiments  were  performed  by  Mike  Drumheller. 


Minimum  Cost  Path 

Method 

Implemented 

Estimated 

90ms 

50ms 

Note:  The  times  quoted  here  are  based  on  a  configuration  of  a  61K  Connec¬ 
tion  Machine,  using  a  Virtual  Processor  ratio  of  1:1. 
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Task 

Implemented 

Estimated 

Edge  detection 

...  . 

Convolution 

3  ms 

2  ms 

Find  Zero-Crossings 

0.5  ms 

0.5ms 

Propagate  label 

36  ms 

36  ms 

Enumerate  curves 

350/rs 

3  50/os 

Rank  and  send  pixels 

91ms 

22  ms 

Total  -  without  Output 

40ms 

39  ms 

Total  -  with  Output 

131ms 

61  ms 

Connected  Component  Labeling 

Doubling  method  (length  =  512  X  512) 

300ms 

Doubling  method  (length  —  512) 

1 50ms 

Scan  method  (12  phases) 

450ms 

150ms 

Hough  Transform 

Full  180  steps  (512  X  512) 

— 

720ms 

Full  180  steps  (256  x  256) 

— 

540ms 

From  edge  elements  (512  x  512) 

— • 

30  ms 

Geometric  Constructions 

Convex  Hull  (from  VD) 

— 

50ms 

Convex  Hull  (Graham  scan) 

— 

200ms 

Convex  Hull  (Jarvis  march) 

150ms 

100ms 

Voronoi  Diagram  (1024  x  1024) 

4s 

2s 

Voronoi  Diagram  (512  x  512) 

Is 

500ms 

Minimum  Spanning  Tree  (from  VD) 

— 

150ms 

Triangle  Visibility 

Multiple  copies 

— 

850ms 

Scanning 

— 

1.0  s 

Graph  Matching 

Per  expansion  step 

— 

50  ms 

Minimum  Cost  Path 

90ms 

50ms 

Figure  6:  Summary  Table 
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